[1] 0.4901847
DATA1220-55, Fall 2024
2024-09-30
The average male height is ~70” (5’10”) and approximately follows the distribution \(N(70, 3)\).
The average female height is ~63” (5’3”) and approximately follows the distribution \(N(63, 3)\).
What is the probability that a random male is taller than the average female but still shorter than the average male?
You can think of this problem as the difference between two percentiles.
The probability that a man is below average in height for men but above average in height for women is the difference between the percentile for 70” and the percentile for 63” in the distribution of male heights or 49%.
5.1: Point estimates and sampling variability
5.2: Confidence intervals for a proportion
5.3: Hypothesis testing for a proportion
Why do we sample?
Easier, cheaper, convenience, etc.
Hard to do a census
What are the downsides of sampling?
May not represent study/target population
Methods might introduce bias
Results could be due to chance
How certain are we that our data is representative?
Bureau of Justice Statistics September 2021 Special Report on Recidivism of Prisoners
Sampled 73,600 prisoner records from National Corrections Reporting Program on 409,300 state prisoners released across 24 states in 2008, representing 69% of all persons released from state prisons
89% of sampled prisoners were male, and 11% of sampled prisoners were female.
42.9% of sampled prisoners were rearrested within 1 year. 43.9% of male sampled prisoners were rearrested within 1 year. 34.4% of female sampled prisoners were rearrested within 1 year.
id sex reoffend
65693 prisoner65693 Female Rearrested
32933 prisoner32933 Male Not Rearrested
52929 prisoner52929 Male Not Rearrested
2813 prisoner02813 Male Rearrested
4611 prisoner04611 Male Rearrested
72046 prisoner72046 Female Not Rearrested
22816 prisoner22816 Male Rearrested
65582 prisoner65582 Female Rearrested
53751 prisoner53751 Male Not Rearrested
63092 prisoner63092 Male Not Rearrested
| sex | Rearrested | Not Rearrested | Total |
|---|---|---|---|
| Male | 28756 | 36748 | 65504 |
| Female | 2785 | 5311 | 8096 |
| Total | 31541 | 42059 | 73600 |
Is a prisoner being rearrested within 1 year of their release independent of their sex?
Remember: when 2 events are independent, \(P(B) \approx P(B|A) \approx P(B|A')\)
\[ \begin{aligned} P(\operatorname{rearrested and male}) &= P(\operatorname{male}) \times P(\operatorname{rearrested}|\operatorname{male}) \\ &\approx P(\operatorname{male}) \times P(\operatorname{rearrested}) \end{aligned} \]
\(P(\operatorname{male}) = 0.89\)
| sex | Rearrested | Not Rearrested | Total |
|---|---|---|---|
| Male | 0.912 | 0.874 | 0.89 |
| Female | 0.088 | 0.126 | 0.11 |
| Total | 1.000 | 1.000 | 1.00 |
\(P(\operatorname{rearrested}) = 0.429\)
| sex | Rearrested | Not Rearrested | Total |
|---|---|---|---|
| Male | 0.439 | 0.561 | 1 |
| Female | 0.344 | 0.656 | 1 |
| Total | 0.429 | 0.571 | 1 |
Does our observed probability \(P(\operatorname{rearrested and male})=0.439\) approximate the probability under an independence model, \(P(\operatorname{rearrested}) \times P(\operatorname{male})\)?
\[ \begin{aligned} P(\operatorname{rearrested}) \times P(\operatorname{male}) &= 0.429 \times 0.890 \\ &= 0.382 \end{aligned} \]
Does \(0.439 \approx 0.382\)?
We are interested in population-level parameters
Complete populations are difficult (or impossible) to collect data from
The sample statistic can be used as a point estimate for a population parameter in sufficiently large samples
| Measure | Parameter | Statistic |
|---|---|---|
| Mean | \(\mu\) | \(\bar{x}\) |
| Proportion | \(p\) | \(\hat{p}\) |
| Difference in Means | \(\mu_1-\mu_2\) | \(\bar{x}_1-\bar{x}_2\) |
| Difference in Proportions | \(p_1-p_2\) | \(\hat{p}_1-\hat{p}_2\) |
| Standard Deviation | \(\sigma\) | \(s\) |
| Correlation | \(\rho\) | \(r\) |
A sampling distribution is a distribution of sample statistics for different samples of the same size from the same population
Not actually observed in the real world, a “hypothetical” population distribution
Sampling distributions also have location and scales
The Law of Large Numbers for Binary Outcomes: Coin Tosses
Sampling distribution of sample proportions of coin flips by sample size.
What are sources of error we COULD know?
What are sources of error we CAN’T know?
A distribution of sample means approximates a normal distribution as the sample size gets larger
Requires independent, identically distributed (I.I.D) variables
Each observation is independent of the next
Each observation comes from the same source distribution
Sample size must be sufficient for estimates to be valid
Standard error (SE) is the standard deviation of the sample statistic
Describes the scale (i.e. variability, sampling error) of the sampling distribution
For a distribution of sample means, \(SE=\frac{\sigma}{\sqrt{n}}\)
For a distribution of sample proportions, \(SE=\sqrt{\frac{p(1-p)}{n}}\)
As \(n\) increases, the standard error \(SE\) decreases.
For a binary category, if you coded the 2 classes as \(0\) and \(1\), the sample mean \(\bar{x}\) equals the sample proportion \(\hat{p}\) of \(1\)’s.
Example: You flip a coin 5 times, and it’s tails 3 times.
\[ \begin{aligned} \hat{p}&= \frac{\operatorname{count}(\operatorname{tails})}{\operatorname{count}(\operatorname{flips})} \\ &= \frac{3}{5} \end{aligned} \]
\[ \begin{aligned} \bar{x}&=\frac{\operatorname{sum}(0, 0, 1, 1, 1)}{n} \\ &=\frac{3}{5} \end{aligned} \]
\[ \hat{p} \sim N\left(p, \sqrt{\frac{p(1-p)}{n}}\right) \]
Need a sufficiently large sample size!
\(np > 10\)
\(n(1-p) > 10\)
Need i.i.d. samples
In this population, \(p=0.05\) and we take random samples of size \(n=50\). Will the distribution of sample proportions be nearly normal?
\(np=50 \times 0.05 = 2.5\), so the sample size conditions are not met. This distribution is not nearly normal.
The assumptions of the central limit theorem fail when the proportion of successes and/or sample size are small.
\[ \bar{x} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right) \]
Need a sufficiently large sample size!
\(n>30\) when underlying distribution is normal
Need larger \(n\) when underlying distribution is skewed
Need i.i.d. samples
If our population has the distribution \(N(30, 10)\) and we take random samples of size \(n=20\). Will the distribution of sample means be nearly normal with the sampling distribution \(N(30, \frac{10}{\sqrt{20}})\)?
We use sample statistics to describe study populations and estimate parameters using sampling distributions
We also describe the variability of our measure and quantify our uncertainty regarding our estimate
We use the overlap between theoretical distributions to decide how meaningful the differences between groups are
Study in Denmark of 4,931,775 individuals ages 12+ or older from 2020-10-01 to 2021-10-05
3,482,295 individuals in the study were vaccinated and 269 developed myocarditis. 48 individuals were vaccinated and developed myocarditis.
Does the probability of developing myocarditis vary by whether or not someone has been vaccinated against COVID-19?
If 269 of the 4,931,775 individuals sampled regardless of vaccination status, what is our point estimate (i.e. sample statistic) for the probability of developing myocarditis in our study population?
\[ \begin{aligned} \hat{p}&=\frac{\operatorname{count}(\operatorname{cases})}{\operatorname{count}(\operatorname{subjects})} \\ &= \frac{269}{4,931,775} \\ &= 0.00005 \end{aligned} \]
If 269 of the 4,931,775 individuals sampled regardless of vaccination status, what is the standard error of our sample statistic \(\hat{p}=0.00005\), the probability of developing myocarditis in our study population?
\[ \begin{aligned} SE_{\hat{p}} &= \sqrt{\frac{p(1-p)}{n}} \\ &= \sqrt{\frac{0.00005(1-0.00005)}{4,931,775}} \\ &= 0.000003 \end{aligned} \]
The proportion of all individuals in the study population who develop myocarditis has a sampling distribution of \(N(0.00005, 0.000003)\).
If my population has the theoretical sampling distribution \(p \sim N(0.00005, 0.000003)\), in what range do 95% of the theoretical sample means occur?
95% Confidence Interval = \((0.000044, 0.000056)\)
95% of the theoretical sample means in this sampling distribution are between 0.000044 and 0.000056.
If 48 of the 3,482,295 vaccinated individuals developed myocarditis, what is our point estimate (i.e. sample statistic) for the probability of developing myocarditis in our vaccinated population?
\[ \begin{aligned} \hat{p}&=\frac{\operatorname{count}(\operatorname{vaccinated cases})}{\operatorname{count}(\operatorname{vaccinated subjects})} \\ &= \frac{48}{3,482,295} \\ &= 0.00001 \end{aligned} \]
How unusual is this sample statistic given our population parameter?
What’s the probability of observing a sample proportion as small or smaller than 0.00001?
If 221 of the 1,449,480 unvaccinated individuals developed myocarditis, what is our point estimate (i.e. sample statistic) for the probability of developing myocarditis in our unvaccinated population?
\[ \begin{aligned} \hat{p}&=\frac{\operatorname{count}(\operatorname{unvaccinated cases})}{\operatorname{count}(\operatorname{unvaccinated subjects})} \\ &= \frac{221}{1,449,480} \\ &= 0.00015 \end{aligned} \]
How unusual is this sample statistic given our population parameter?
What’s the probability of observing a sample proportion greater than 0.00015?
DATA1220-55 Fall 2024, Class 14 | Updated: 2024-09-30 | Canvas | Campuswire